03 Trimming and Filtering

Assessing Read Quality

Time
  • Teaching: 30 min
  • Exercises: 25 min
Questions
  • How can we get rid of sequence data that does not meet our quality standards?”
Objectives
  • Clean FASTQ reads using Trimmomatic
  • Interpret a FastQC plot summarizing per-base quality across all reads
Keypoints
  • The options you set for the software you use is important!
  • Data cleaning is essential at the beginning of metagenomics workflows
  • Use Trimmomatic to get rid of adapters and low-quality bases or reads
  • Carefully fill in the parameters and options required to run the software

Cleaning reads

In the last episode, we took a high-level look at the quality of each of our samples using FastQC. We visualized per-base quality graphs showing the distribution of the quality at each base across all the reads from our sample. This information helps us to determine the quality threshold we will accept, and thus, we saw information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. However, this does not mean that our samples should be thrown out! It is common to have some quality metrics fail, which may or may not be a problem for your downstream application. For our workflow, we will remove some low-quality sequences to reduce our false-positive rate due to sequencing errors.

To accomplish this, we will use a program called Trimmomatic. This useful tool filters poor quality reads and trims poor quality bases from the specified samples.

Trimmomatic options

Trimmomatic has a variety of options to accomplish its task.

Option Meaning
ILLUMINACLIP Cut adapter and other illumina-specific sequences from the read.
SLIDINGWINDOW Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
MINLEN Drop the read if it is below a specified length.
LEADING Cut bases off the start of a read, if below a threshold quality.
TRAILING Cut bases off the end of a read, if below a threshold quality.
CROP Cut the read to a specified length.
HEADCROP Cut the specified number of bases from the start of the read.
AVGQUAL Drop the read if the average quality is below a specified value.
MAXINFO Trim reads adaptively, balancing read length and error rate to maximise the value of each read.

First, we must specify whether we have reads that are paired-end, single-end or a paired-end collection. Next, we will specify whether to perform ILLUMINACLIP. For our reads we want to perform adapter removal, using TruSeq3. We can use the default parameters. Next we will chose which “Trimmomatic Operation” we want to use. You can use multiple operations but we will just use the SLIDINGWINDOW operation, using 4 bases to average across and a average quality score of 20.

Although we will use only a few options and trimming steps in our analysis, understanding the steps you are using to clean your data is essential. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.

Overview of Trimmomatic Steps

Running Trimmomatic

Now we have an understanding of the parameters we can use with Trimmomatic, we can run the tool. Make sure to select the Output trimmomatic log messages option, as this will provide useful information regarding what Trimmommatic actually did.

Select Paired-end (two separate input files)
Figure 1: Select Paired-end

ILLUMINACLIP
Figure 2: Settings for ILLUMINACLIP

SLIDINGWINDOW
Figure 3: Settings SLIDINGWINDOW

Once Trimmomatic completes, you will have 5 outputs:

  • Trimmomatic on X data (R1 paired)
  • Trimmomatic on Y data (R2 paired)
  • Trimmomatic on X data (R1 unpaired)
  • Trimmomatic on Y data (R2 unpaired)
  • Trimmomatic on X and Y data (log file)

The reads we are interested in for this analysis are the paired outputs and we are also interested in the log file.

Exercise 1: What did Trimmomatic do?

Use the output from your Trimmomatic command to answer the following questions.

  1. What percentage of reads did we discard from our sample?
  2. What percentage of reads did we keep both pairs?

Use the log file to answer this question. You want to look for Dropped and Both Surviving. For Sample_108 this gives the following:

  1. 0.00%
  2. 99.49%

Editing Dataset Attributes

You have probably noticed by now that the names of our files are beginning to be long and difficult to decipher. Therefore, we should edit the data attributes of our files, to give more descriptive names.

To do this, click the “pencil” icon and edit the name, then click “Save”.

Edit attribute

Checking the Impact of Trimmomatic

To assess the impact Trimmomatic had on our reads, we can rerun FastQC.

(a) Metabarcoding Sample Before Trimmomatic

(b) Metabarcoding Sample After Trimmomatic

Figure 4: FastQC Comparison of Metabarcoding Sample

(a) Metashotgun Sample Before Trimmomatic

(b) Metashotgun Sample After Trimmomatic

Figure 5: FastQC Comparison of Metashotgun Sample

From the comparison we can see that the metabarcoding sample has X has